3 research outputs found
NExT-GPT: Any-to-Any Multimodal LLM
While recently Multimodal Large Language Models (MM-LLMs) have made exciting
strides, they mostly fall prey to the limitation of only input-side multimodal
understanding, without the ability to produce content in multiple modalities.
As we humans always perceive the world and communicate with people through
various modalities, developing any-to-any MM-LLMs capable of accepting and
delivering content in any modality becomes essential to human-level AI. To fill
the gap, we present an end-to-end general-purpose any-to-any MM-LLM system,
NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion
decoders, enabling NExT-GPT to perceive inputs and generate outputs in
arbitrary combinations of text, images, videos, and audio. By leveraging the
existing well-trained highly-performing encoders and decoders, NExT-GPT is
tuned with only a small amount of parameter (1%) of certain projection layers,
which not only benefits low-cost training and also facilitates convenient
expansion to more potential modalities. Moreover, we introduce a
modality-switching instruction tuning (MosIT) and manually curate a
high-quality dataset for MosIT, based on which NExT-GPT is empowered with
complex cross-modal semantic understanding and content generation. Overall, our
research showcases the promising possibility of building an AI agent capable of
modeling universal modalities, paving the way for more human-like AI research
in the community. Project page: https://next-gpt.github.io/Comment: work in progres
Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization
We investigate composed image retrieval with text feedback. Users gradually
look for the target of interest by moving from coarse to fine-grained feedback.
However, existing methods merely focus on the latter, i.e, fine-grained search,
by harnessing positive and negative pairs during training. This pair-based
paradigm only considers the one-to-one distance between a pair of specific
points, which is not aligned with the one-to-many coarse-grained retrieval
process and compromises the recall rate. In an attempt to fill this gap, we
introduce a unified learning approach to simultaneously modeling the coarse-
and fine-grained retrieval by considering the multi-grained uncertainty. The
key idea underpinning the proposed method is to integrate fine- and
coarse-grained retrieval as matching data points with small and large
fluctuations, respectively. Specifically, our method contains two modules:
uncertainty modeling and uncertainty regularization. (1) The uncertainty
modeling simulates the multi-grained queries by introducing identically
distributed fluctuations in the feature space. (2) Based on the uncertainty
modeling, we further introduce uncertainty regularization to adapt the matching
objective according to the fluctuation range. Compared with existing methods,
the proposed strategy explicitly prevents the model from pushing away potential
candidates in the early stage, and thus improves the recall rate. On the three
public datasets, i.e., FashionIQ, Fashion200k, and Shoes, the proposed method
has achieved +4.03%, + 3.38%, and + 2.40% Recall@50 accuracy over a strong
baseline, respectively
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation
In the text-to-image generation field, recent remarkable progress in Stable
Diffusion makes it possible to generate rich kinds of novel photorealistic
images. However, current models still face misalignment issues (e.g.,
problematic spatial relation understanding and numeration failure) in complex
natural scenes, which impedes the high-faithfulness text-to-image generation.
Although recent efforts have been made to improve controllability by giving
fine-grained guidance (e.g., sketch and scribbles), this issue has not been
fundamentally tackled since users have to provide such guidance information
manually. In this work, we strive to synthesize high-fidelity images that are
semantically aligned with a given textual prompt without any guidance. Toward
this end, we propose a coarse-to-fine paradigm to achieve layout planning and
image generation. Concretely, we first generate the coarse-grained layout
conditioned on a given textual prompt via in-context learning based on Large
Language Models. Afterward, we propose a fine-grained object-interaction
diffusion method to synthesize high-faithfulness images conditioned on the
prompt and the automatically generated layout. Extensive experiments
demonstrate that our proposed method outperforms the state-of-the-art models in
terms of layout and image generation. Our code and settings are available at
https://layoutllm-t2i.github.io.Comment: Accepted by ACM MM 202